Field of the Invention
This invention relates to cleaning, transforming, integrating, and deduplicating data from multiple data sources. Products and services embodying the invention operate in the markets including data cleaning, record deduplication, data integration, data quality, and data transformation.
Background
Systems such as those provided by Informatica, Oracle's Silver Creek Systems, and IBM InfoSphere QualityStage are used to integrate data coming from different data sources, standardize data formats (e.g., dates and addresses), and remove errors from data (e.g., duplicates). These systems typically depend on a data expert (i.e., a human that has knowledge about the semantics of the data) to manually specify low-level procedures to clean the data. Coming up with an efficient and effective data integration plan mainly depends on the skills of the data expert. The audience targeted by such systems are assumed to be extremely familiar with the data (e.g., experienced in data analytics).
Modern data integration problems, on the other hand, impose stricter requirements. The system operator should not have to be a data expert. Also, the quality and the efficiency of the system should not depend on the expertise of the system operator(s).